Introduction

Accurately testing for marijuana usage is very important in different everyday scenarios. For example, THC in marijuana can affect an individuals motor skills, depth perception, and overall cognition. This then can hinder their ability to work effectively and safely. According to the National Institute on Drug Abuse, it is noted that employees that tested positive for marijuana had an increase of 55% when it came to workplace accidents and that they were responsible for another increase of 85% of work-related injuries 1. These liabilities can hurt the company and more importantly, the individual; hence why it is paramount for companies to run effective drug tests on their employees. Another example why finding out which compound and matrix is most effective to use when conducting a drug test is for scenarios in which we’d like to find out if a driver is under the influence or not. Quoted directly from the National Highway Traffic Safety Administration, “In the 2013-2014 survey 2, 12.6 percent of weekend nighttime drivers tested positive for marijuana. That’s a 48-percent increase in less than 10 years” 3.

All of this information is quite alarming and that is why we want to figure out which compound and matrix is the most effective to analyze when trying to figure out if an individual is under the influence or not. It is also worth mentioning that we’d like to go deeper with this study by trying to find out if some of these compounds are more sensitive to higher or lower doses of marijuana. The relationship between these variables can provide us with substantial information in regards to figuring out if some compounds are worth paying more attention to than others. This ultimately saves the tester a lot of time when they’re running a drug test on an individual.

For our case study, the primary focus is to figure out which compound from which matrix (whole blood, breath, oral fluid) is the best bio-marker for recent use, with “recent use” definied to be within the last 3 hours.

Load packages

library(tidymodels)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(janitor)
library(purrr)
library(rstatix)
library(cowplot)

Question

  • Which compound, in which matrix, and at what cutoff is the best bio-marker of recent use? (within 3h)
  • what’s the optimal model for the relationship between specific compounds?

The Data

The data we will be using comes from a placebo-controlled, double-blinded, randomized study published by Hoffman et. al. (2021). In the study, volunteers were randomly assigned to smoke a ciggertte containing placebo (0.02%), or 5.9% (low dose) or 13.4% (high dose) THC. Samples of whole blood (WB), oral fluid(OF), breath (BR) are taken before and after smoking.

Data Import

WB = read.csv("data/Blood.csv")
BR = read.csv("data/Breath.csv")
OF = read.csv("data/OF.csv")

Data Wrangling

We re-coded and re-leveled variables (Treatment and Group), cleans column names, and renames specific columns (x11_oh_thc to thcoh, thc_v to thcv, thccooh_gluc to thc_cooh_gluc, and thccooh to thc_cooh) in all 3 tables. Using janitor package to organized column names.

OF <- OF |>
  mutate(Treatment = fct_recode(Treatment, 
                                "5.9% THC (low dose)" = "5.90%",
                                "13.4% THC (high dose)" = "13.40%"),
         Treatment = fct_relevel(Treatment, "Placebo", "5.9% THC (low dose)"),
         Group = fct_recode(Group, 
                            "Occasional user" = "Not experienced user",
                            "Frequent user" = "Experienced user" )) |>  
  janitor::clean_names() |>
  rename(thcoh = x11_oh_thc,
         thcv = thc_v)

WB <- WB |> 
  mutate(Treatment = fct_recode(Treatment, 
                                "5.9% THC (low dose)" = "5.90%",
                                "13.4% THC (high dose)" = "13.40%"),
         Treatment = fct_relevel(Treatment, "Placebo", "5.9% THC (low dose)")) |> 
  janitor::clean_names() |>
  rename(fluid = fluid_type,
         thcoh = x11_oh_thc,
         thccooh = thc_cooh,
         thccooh_gluc = thc_cooh_gluc,
         thcv = thc_v)

BR <- BR |> 
  mutate(Treatment = fct_recode(Treatment, 
                                "5.9% THC (low dose)" = "5.90%",
                                "13.4% THC (high dose)" = "13.40%"),
         Treatment = fct_relevel(Treatment, "Placebo", "5.9% THC (low dose)"),
         Group = fct_recode(Group, 
                            "Occasional user" = "Not experienced user",
                            "Frequent user" = "Experienced user" )) |> 
  janitor::clean_names() |> 
  rename(thc = thc_pg_pad)


compounds_WB <-  as.list(colnames(Filter(function(x) !all(is.na(x)), WB[6:13])))
compounds_BR <-  as.list(colnames(Filter(function(x) !all(is.na(x)), BR[6])))
compounds_OF <-  as.list(colnames(Filter(function(x) !all(is.na(x)), OF[6:12])))

Then we created 3 tables based on specific minutes and labeled accordingly, covering pre-smoking and subsequent post-smoking time periods for blood, breath, and oral fluid data.

timepoints_WB <- tibble(
  start = c(-400, 0, 30, 70, 100, 180, 210, 240, 270, 300),
  stop = c(
    0,
    30,
    70,
    100,
    180,
    210,
    240,
    270,
    300,
    max(WB$time_from_start, na.rm = TRUE)
  ),
  timepoint = c(
    "pre-smoking",
    "0-30 min",
    "31-70 min",
    "71-100 min",
    "101-180 min",
    "181-210 min",
    "211-240 min",
    "241-270 min",
    "271-300 min",
    "301+ min"
  )
)

timepoints_BR <- tibble(
  start = c(-400, 0, 40, 90, 180, 210, 240, 270),
  stop = c(
    0,
    40,
    90,
    180,
    210,
    240,
    270,
    max(BR$time_from_start, na.rm = TRUE)
  ),
  timepoint = c(
    "pre-smoking",
    "0-40 min",
    "41-90 min",
    "91-180 min",
    "181-210 min",
    "211-240 min",
    "241-270 min",
    "271+ min"
  )
)

timepoints_OF <- tibble(
  start = c(-400, 0, 30, 90, 180, 210, 240, 270),
  stop = c(0, 30, 90, 180, 210, 240, 270,
           max(OF$time_from_start, na.rm = TRUE)),
  timepoint = c(
    "pre-smoking",
    "0-30 min",
    "31-90 min",
    "91-180 min",
    "181-210 min",
    "211-240 min",
    "241-270 min",
    "271+ min"
  )
)

assign_timepoint <- function(x, timepoints) {
  if (!is.na(x)) {
    timepoints$timepoint[x > timepoints$start & x <= timepoints$stop]
  } else{
    NA
  }
}

We created a new column, timepoint_use, in each table by mapping the time_from_start values to specific timepoints defined in separate reference data frames (timepoints_WB, timepoints_OF, timepoints_BR). Finally, re-leveled the timepoint_use factor variable to align with the order specified in the reference data frames. This ensures consistent and meaningful timepoint labels for subsequent analyses or visualizations in the study.

 WB <- WB |> 
  mutate(timepoint_use = map_chr(time_from_start, 
                                 assign_timepoint, 
                                 timepoints=timepoints_WB),
         timepoint_use = fct_relevel(timepoint_use, timepoints_WB$timepoint))

OF <- OF |> 
  mutate(timepoint_use = map_chr(time_from_start, 
                                 assign_timepoint, 
                                 timepoints=timepoints_OF),
         timepoint_use = fct_relevel(timepoint_use, timepoints_OF$timepoint))

BR <- BR |> 
  mutate(timepoint_use = map_chr(time_from_start, 
                                 assign_timepoint, 
                                 timepoints=timepoints_BR),
         timepoint_use = fct_relevel(timepoint_use, timepoints_BR$timepoint))

Next, we will remove duplicate ID’s.

WB <- drop_dups(WB)
OF <- drop_dups(OF)
BR <- drop_dups(BR)


#im saving a copy of this clean OF for the extended question section. 
OF_ex = drop_dups(OF)

EDA

Compounds Measurements Over Time By Treatment

The following plots include of all compounds against time, distinguished by color according to their respective groups. To achieve a comprehensive understanding, we generated scatterplots for compounds across three distinct matrices—namely, whole blood, oral fluid, and breath. This analysis encompasses various time points and considers different treatments, namely, placebo, low dose, and high dose.

Upon close examination of the scatter plots, a noteworthy observation emerges, particularly concerning the THC biomarker in whole blood. This specific biomarker appears to offer a potentially enhanced indication of recent cannabis joint usage. The scatter plot reveals a discernible separation between the placebo and THC treatment groups, suggesting that the THC measurement in whole blood may serve as a more reliable indicator of recent cannabis joint consumption.

scatter_WB <- map(compounds_WB, ~ compound_scatterplot_group_by_treatment( 
    dataset=WB, 
    compound=.x, 
    timepoints=timepoints_WB))

scatter_OF <- map(compounds_OF, ~ compound_scatterplot_group_by_treatment( 
    dataset=OF, 
    compound=.x, 
    timepoints=timepoints_OF))

scatter_BR <- map(compounds_BR, ~ compound_scatterplot_group_by_treatment( 
    dataset=BR, 
    compound=.x, 
    timepoints=timepoints_BR))

In the presented set of scatter plots, all compounds are graphically depicted against time, with color distinctions denoting different treatment conditions and a log transformation applied to the y-axis, which represents the respective compound measurements. A comparative analysis with the previous scatterplots reveals a modification: specifically, a log transformation has been applied to the y-axis, providing an alternative perspective on the measurement of the compounds.

Upon closer examination, a notable observation emerges. The measurement of THC from breath exhibits a more discernible separation between the placebo and THC treatment groups in the log-transformed scatterplots. This suggests that the log transformation on the y-axis enhances the visibility of distinctions between the treatment conditions for THC. The log transformation, by compressing the scale, may unveil nuances and patterns that are not as apparent on a linear scale. This nuanced insight into THC measurements underscores the importance of considering the impact of transformation techniques when analyzing compound data over time in the context of different treatments. The enhanced separation observed in the log-transformed scatterplots could potentially provide valuable insights into the effects of treatments on THC levels and underscores the sensitivity of the chosen visualization approach.

scatter_WB_by_treatment <- map(compounds_WB, ~ compound_scatterplot_group_by_treatment_log( 
    dataset=WB, 
    compound=.x, 
    timepoints=timepoints_WB))

scatter_OF_by_treatment <- map(compounds_OF, ~ compound_scatterplot_group_by_treatment_log( 
    dataset=OF, 
    compound=.x, 
    timepoints=timepoints_OF))

scatter_BR_by_treatment <- map(compounds_BR, ~ compound_scatterplot_group_by_treatment_log( 
    dataset=BR, 
    compound=.x, 
    timepoints=timepoints_BR))

Now we will delete compounds that do not work from the compound data frame such as:

WB: CBD, THCCOOH, THCCOOH_GLUC, THCV OF: THCOH

compounds_WB = compounds_WB[- c(2, 5, 6, 8)]
compounds_OF = compounds_OF[- c(4)]

Analysis

Calculating the Sensitivity and Specificity.

output_WB <- map_dfr(compounds_WB,
                     ~ sens_spec_cpd(
                       dataset = WB,
                       cpd = all_of(.x),
                       timepoints =  timepoints_WB
                     )) |> clean_gluc()

output_BR <- map_dfr(compounds_BR, 
                     ~ sens_spec_cpd(
                       dataset = BR,
                       cpd = all_of(.x),
                       timepoints = timepoints_BR
                     ))  |> clean_gluc()

output_OF <- map_dfr(compounds_OF,
                     ~ sens_spec_cpd(
                       dataset = OF,
                       cpd = all_of(.x),
                       timepoints = timepoints_OF
                     ))  |> clean_gluc()


output_WB
## # A tibble: 2,470 × 17
##       TP    FN    FP    TN detection_limit compound time_start time_stop
##    <dbl> <dbl> <int> <int>           <dbl> <chr>         <dbl>     <dbl>
##  1     0     0     1   188           0.5   CBN            -400         0
##  2     0     0     0   189           0.6   CBN            -400         0
##  3     0     0     0   189           0.7   CBN            -400         0
##  4     0     0     0   189           0.753 CBN            -400         0
##  5     0     0     0   189           0.8   CBN            -400         0
##  6     0     0     0   189           0.831 CBN            -400         0
##  7     0     0     0   189           0.9   CBN            -400         0
##  8     0     0     0   189           0.909 CBN            -400         0
##  9     0     0     0   189           1     CBN            -400         0
## 10     0     0     0   189           1.1   CBN            -400         0
## # ℹ 2,460 more rows
## # ℹ 9 more variables: time_window <chr>, NAs <int>, N <int>, N_removed <int>,
## #   Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
## #   Efficiency <dbl>
output_BR
## # A tibble: 816 × 17
##       TP    FN    FP    TN detection_limit compound time_start time_stop
##    <dbl> <dbl> <int> <int>           <dbl> <chr>         <dbl>     <dbl>
##  1     0     0     6   185             0.5 THC            -400         0
##  2     0     0     6   185            80.1 THC            -400         0
##  3     0     0     6   185            85.6 THC            -400         0
##  4     0     0     6   185            86.4 THC            -400         0
##  5     0     0     6   185            91.8 THC            -400         0
##  6     0     0     6   185            93.8 THC            -400         0
##  7     0     0     6   185            98.2 THC            -400         0
##  8     0     0     6   185           105.  THC            -400         0
##  9     0     0     6   185           105.  THC            -400         0
## 10     0     0     6   185           107.  THC            -400         0
## # ℹ 806 more rows
## # ℹ 9 more variables: time_window <chr>, NAs <int>, N <int>, N_removed <int>,
## #   Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
## #   Efficiency <dbl>
output_OF
## # A tibble: 3,928 × 17
##       TP    FN    FP    TN detection_limit compound time_start time_stop
##    <dbl> <dbl> <int> <int>           <dbl> <chr>         <dbl>     <dbl>
##  1     0     0     5   187            0.5  CBN            -400         0
##  2     0     0     8   184            0.4  CBN            -400         0
##  3     0     0     5   187            0.48 CBN            -400         0
##  4     0     0     3   189            0.6  CBN            -400         0
##  5     0     0     3   189            0.7  CBN            -400         0
##  6     0     0     3   189            0.8  CBN            -400         0
##  7     0     0     3   189            0.9  CBN            -400         0
##  8     0     0     1   191            1    CBN            -400         0
##  9     0     0     1   191            1.1  CBN            -400         0
## 10     0     0     1   191            1.16 CBN            -400         0
## # ℹ 3,918 more rows
## # ℹ 9 more variables: time_window <chr>, NAs <int>, N <int>, N_removed <int>,
## #   Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
## #   Efficiency <dbl>

Cutoff vs. Sensitivity/Specificity

Here we plot the value of the cutoff against sensitivity and specificity for every compound in every matrix, and arrange them all into one big plot. This is also known as the ROC curve of sensitivity and specificity against cutoff values suggests an exploration of optimal cutoff points. Overall, the specificity of all compounds increases when detection limit rises. On the other hand, sensitivity drops to zero when detection limit rises.

#arranges ss plots into one
ss_bottom_row <-
  plot_grid(
    ss_OF,
    ss_BR,
    labels = c('B', 'C'),
    label_size = 12,
    ncol = 2,
    rel_widths = c(0.66, .33)
  )
plot_grid(
  ss_WB,
  ss_bottom_row,
  labels = c('A', ''),
  label_size = 12,
  ncol = 1
)

Average Sensitivity and Specificity vs. Detection Limit

### Average sensitivity and specificity vs. detection limit
output_WB_avg = average_sens_spec(output = output_WB)
output_OF_avg = average_sens_spec(output = output_OF)
output_BR_avg = average_sens_spec(output = output_BR)

ss_WB_avg_together <-
  ss_plot_avg_together(output_WB_avg, tpts = length(unique(output_WB$time_start)), tissue = "Blood")

ss_OF_avg_together <-
  ss_plot_avg_together(output_OF_avg, tpts = length(unique(output_WB$time_start)), tissue = "Oral Fluid")

ss_BR_avg_together <-
  ss_plot_avg_together(output_BR_avg, tpts = length(unique(output_WB$time_start)), tissue = "Breath")

It should be apparent that OF-THC is the superior choice. Now we will dig deeper into OF-THC and find the specific cutoff. Referring back to the average sensitivity and specificity vs. detection limit plot, we see that the detection limit is extremely close to 0 when both sensitivity and specificity are high. Now we will try out some more cutoff values near 0.

To start, we will remove every compound where the average sensitivity and specificity does not intersect. The reason for doing so is that for compounds with no intersection, optimal sensitivity, the left most point of the graph, is where specificity is the worst. There is no room for adjustment there because if we tried to integrate further testing, it would decline progressively.

compounds_WB = c("thc")
compounds_OF = c("thc")
compounds_BR = NULL

Sensitivity vs. Specificity

In this visual representation, we graph the sensitivity against specificity for each compound within every matrix, consolidating the data into a comprehensive plot. This collective visualization allows for a convenient comparison of the performance of various bio markers concerning their specificity and sensitivity.

output_WB <- map_dfr(compounds_WB,
                     ~ sens_spec_cpd(
                       dataset = WB,
                       cpd = all_of(.x),
                       timepoints =  timepoints_WB
                     )) |> clean_gluc()


output_OF <- map_dfr(compounds_OF,
                     ~ sens_spec_cpd(
                       dataset = OF,
                       cpd = all_of(.x),
                       timepoints = timepoints_OF
                     ))  |> clean_gluc()

#plot sensitivity vs. specificity
roc_WB = roc_plot(output_WB, tpts = length(unique(output_WB$time_start)), tissue = "Blood")

roc_OF = roc_plot(output_OF, tpts = length(unique(output_OF$time_start)), tissue = "Oral Fluid")

Plot Sensitivity and Speciticity Over Time Given Specific Cutoffs

It should be apparent that OF-THC is the superior choice. Now we will dig deeper into the relationship between the two and find the specific cutoff value by referring back to the Average Sensitivity and Specificity vs. Detection Limit Plot. We see that the detection limit is again, very close to 0 when both sensitivity and specificity are greater in value. Similarly to before, we will try out some more cutoff values that are close to 0.

Now we will take a deeper dive into sensitivity and specificity over time over time for the measurement of THC in Blood and Oral Fluid tissues. In direct comparison between the two measurement methods of THC, it becomes evident that Oral Fluid outshines its counterpart in terms of both sensitivity and specificity, particularly within the critical time span of three hours post-smoking.

#pass specific cutoff into splits parameter
OF_THC <- sens_spec_cpd(
  dataset = OF,
  cpd = 'thc',
  timepoints = timepoints_OF,
  splits =  c(0.5, 1, 2, 5, 10)
) |> clean_gluc()

of_levels <- c("pre-smoking\nN=192", "0-30\nmin\nN=192", "31-90\nmin\nN=117",
               "91-180\nmin\nN=99", "181-210\nmin\nN=102", "211-240\nmin\nN=83",
               "241-270\nmin\nN=90",  "271+\nmin\nN=76")

plot_cutoffs(dataset=OF_THC, 
             timepoint_use_variable=OF$timepoint_use, 
             tissue="Oral Fluid", 
             cpd="THC", 
             x_labels=NULL)
## [[1]]

## 
## [[2]]
## # A tibble: 40 × 18
##       TP    FN    FP    TN detection_limit compound time_start time_stop
##    <dbl> <dbl> <int> <int> <fct>           <chr>         <dbl>     <dbl>
##  1     0     0    35   157 0.5             THC            -400         0
##  2     0     0    20   172 1               THC            -400         0
##  3     0     0     9   183 2               THC            -400         0
##  4     0     0     0   192 5               THC            -400         0
##  5     0     0     0   192 10              THC            -400         0
##  6   129     0    39    24 0.5             THC               0        30
##  7   129     0    30    33 1               THC               0        30
##  8   128     1    19    44 2               THC               0        30
##  9   128     1     3    60 5               THC               0        30
## 10   125     4     1    62 10              THC               0        30
## # ℹ 30 more rows
## # ℹ 10 more variables: time_window <fct>, NAs <int>, N <int>, N_removed <int>,
## #   Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
## #   Efficiency <dbl>, my_label <fct>

The average sensitivity is a lot more sensitive to change than the average specificity; specificity only dips in the 31 to 90 minute window when the cutoff is lowered. In contrary, a lower cutoff increases overall sensitivity all across the board, no matter the time.

Additionally, this 31 to 90 minute window where the specificity is heavily affected by a low cutoff is trivial. It should be quite apparent that some is under the influence if they smoked within the last 90 minutes. A lowered specifcity in this time frame doesn’t raise too much concern, considering how much sensitivity is gained when using a low cutoff.

To wrap it all up, a low cutoff is optimal, approximately somewhere in between 0 to 2. Now we will test more cutoffs in this range.

OF_THC <- sens_spec_cpd(
  dataset = OF,
  cpd = 'thc',
  timepoints = timepoints_OF,
  splits =  c(0.1, 0.25, 0.5, 1, 1.5)
) |> clean_gluc()

blood_levels <- c("pre-smoking\nN=189", "0-30\nmin\nN=187", "31-70\nmin\nN=165",
                  "71-100\nmin\nN=157", "101-180\nmin\nN=168", "181-210\nmin\nN=103",
                  "211-240\nmin\nN=127", "241-270\nmin\nN=137", "271-300\nmin\nN=120",
                  "301+\nmin\nN=88")

of_levels <- c("pre-smoking\nN=192", "0-30\nmin\nN=192", "31-90\nmin\nN=117",
               "91-180\nmin\nN=99", "181-210\nmin\nN=102", "211-240\nmin\nN=83",
               "241-270\nmin\nN=90",  "271+\nmin\nN=76")

plot_cutoffs(dataset=OF_THC, 
             timepoint_use_variable=OF$timepoint_use, 
             tissue="Oral Fluid", 
             cpd="THC", 
             x_labels=NULL)
## [[1]]

## 
## [[2]]
## # A tibble: 40 × 18
##       TP    FN    FP    TN detection_limit compound time_start time_stop
##    <dbl> <dbl> <int> <int> <fct>           <chr>         <dbl>     <dbl>
##  1     0     0    37   155 0.1             THC            -400         0
##  2     0     0    37   155 0.25            THC            -400         0
##  3     0     0    35   157 0.5             THC            -400         0
##  4     0     0    20   172 1               THC            -400         0
##  5     0     0    12   180 1.5             THC            -400         0
##  6   129     0    44    19 0.1             THC               0        30
##  7   129     0    44    19 0.25            THC               0        30
##  8   129     0    39    24 0.5             THC               0        30
##  9   129     0    30    33 1               THC               0        30
## 10   128     1    23    40 1.5             THC               0        30
## # ℹ 30 more rows
## # ℹ 10 more variables: time_window <fct>, NAs <int>, N <int>, N_removed <int>,
## #   Sensitivity <dbl>, Specificity <dbl>, PPV <dbl>, NPV <dbl>,
## #   Efficiency <dbl>, my_label <fct>

These cutoffs all seem very promising. Now we would like to find a way to quantify our findings by calculating the sensitivity and specificity for cutoff values in between 0 and 2.

output_OF = sens_spec_cpd_OFTHC(
                       dataset = OF,
                       cpd = "thc",
                       timepoints = timepoints_OF
                     )  |> clean_gluc()

output_OF_avg = average_sens_spec(output = output_OF)

output_OF_avg
## # A tibble: 101 × 4
##    compound detection_limit average_sensitivity average_specificity
##    <chr>              <dbl>               <dbl>               <dbl>
##  1 THC                 0                  0.956               0    
##  2 THC                 0.02               0.956               0.817
##  3 THC                 0.04               0.956               0.817
##  4 THC                 0.06               0.956               0.817
##  5 THC                 0.08               0.956               0.817
##  6 THC                 0.1                0.956               0.817
##  7 THC                 0.12               0.956               0.817
##  8 THC                 0.14               0.956               0.817
##  9 THC                 0.16               0.956               0.817
## 10 THC                 0.18               0.956               0.817
## # ℹ 91 more rows

Quickly plotting:

ss_OF_avg_together <-
  ss_plot_avg_together(output_OF_avg, tpts = length(unique(output_WB$time_start)), tissue = "Oral Fluid")

Now this is our result. The place where they intersect is the maximum sensitivity and specificity. Now we will get the specific value of this intersection.

output_OF_avg |>
  mutate(diff = abs(average_sensitivity-average_specificity)) |>
  arrange(diff)
## # A tibble: 101 × 5
##    compound detection_limit average_sensitivity average_specificity    diff
##    <chr>              <dbl>               <dbl>               <dbl>   <dbl>
##  1 THC                 0.82               0.893               0.890 0.00338
##  2 THC                 0.84               0.893               0.890 0.00338
##  3 THC                 0.86               0.893               0.890 0.00338
##  4 THC                 0.88               0.893               0.890 0.00338
##  5 THC                 0.9                0.893               0.890 0.00338
##  6 THC                 0.7                0.903               0.879 0.0240 
##  7 THC                 0.72               0.903               0.879 0.0240 
##  8 THC                 0.74               0.903               0.879 0.0240 
##  9 THC                 0.76               0.903               0.879 0.0240 
## 10 THC                 0.78               0.903               0.879 0.0240 
## # ℹ 91 more rows

At the cutoff range of 0.82 to 0.90, the difference between the average sensitivity and average specificity is minimized. We will settle with 0.85 for this value.

Results & Discussion

During our thorough exploratory data analysis, a discernible pattern emerged, highlighting the potency of the THC compound in effectively indicating recent marijuana usage. Consequently, our focus in the data analysis section specifically hones in on the sensitivity and specificity cutoff measurements within the Blood and Oral Fluid tissues. Notably, Oral Fluid consistently exhibits superior sensitivity across all time points. By scrutinizing the Receiver Operating Characteristic (ROC) curve comparing the THC measurements in Blood and Oral Fluid, a clear trend emerges – Oral Fluid surpasses Blood in accuracy. In simpler terms, Oral Fluid proves more adept at detecting recent marijuana joint usage compared to its Blood counterpart. This confirmation underscores the THC measurement in Oral Fluid as the paramount biomarker for recent use.

Moving forward, our focus shifts to determining the optimal cutoff values for both sensitivity and specificity. To achieve this, we calculate the average sensitivity and specificity across all time windows for various detection limits, identifying the point at which these metrics intersect – a key indicator of the optimal cutoff. Upon plotting the graph and meticulously examining the associated table, a convergence becomes evident at detection limits ranging from 0.82 to 0.90. Hence, it becomes apparent that the 0.82 to 0.90 detection limit of the THC compound in Oral Fluid stands out as the most effective biomarker for recent use.

Moreover, our exploration extends to a broader question. The visualizations shed light on the distinct responses of various compounds to marijuana dosage. Notably, we focus on key comparisons: CBG with THC in Blood, CBN with CBG in Oral Fluid, and CBG with THCV in Oral Fluid. The analysis of CBG with THC in Blood reveals a noteworthy observation – the coefficient of THC in the low dose group significantly exceeds that in the high dose group. This implies that, for the low dose group, each increment in THC correlates more strongly with CBG compared to the high dose group. Shifting attention to the comparison between CBN with CBG in Oral Fluid, the coefficient of CBG in the high dose group is notably higher. This suggests that as the dosage increases, a heightened correlation emerges between CBG and CBN. These nuanced insights deepen our understanding of compound interactions in response to varying marijuana doses.

The results of our case study on bio markers of recent marijuana use reveal intriguing insights into the choice of compound, matrix, and cutoff for effective detection. The overarching goal was to identify the most potent biomarker that, when combined with a specific matrix and cutoff, can reliably indicate recent marijuana use within a three-hour time frame.

Our analysis consistently points to THC in oral fluid as the optimal biomarker for recent marijuana use. The scatterplots depicting compound measurements over time clearly show a distinct separation between the placebo and THC treatment groups, particularly in oral fluid. The ROC curve analysis further supports this conclusion, with oral fluid THC demonstrating superior sensitivity across all time points compared to other matrices. The enhanced sensitivity in oral fluid makes it a compelling choice for detecting recent marijuana use accurately.

To determine the most effective cutoff for sensitivity and specificity, we conducted a thorough analysis, revealing that a cutoff value between 0.82 and 0.90 maximizes the balance between sensitivity and specificity. We opted for 0.85 as the optimal cutoff, considering the intersection point where the difference between average sensitivity and specificity is minimized. This ensures a practical and balanced approach for identifying recent marijuana use.

Limitations & Suggestions

There are some other possible ways of going about this. For example, someone who was recently high should be easily detectable by an officer. Therefore, a model wouldn’t really be necessary when predicting if someone had just recently smoked a joint.

Conclusion

In sum, the optimal bio marker for recent use is THC in ORal Fluid with a cutoff value of 0.85.

Extended Question

Extended Question: Wrangling

WB_long = WB |>
  pivot_longer(6:13, names_to = "compound")

OF_long = OF |>
  pivot_longer(6:12, names_to = "compound")

BR_long <- BR |> pivot_longer(6)

df_full <- bind_rows(WB_long, OF_long, BR_long)

The Pairplot That Started it All

OF = OF_ex
compounds_OF <-  as.list(colnames(Filter(function(x) !all(is.na(x)), OF[6:12])))

pairs(OF[,unlist(compounds_OF)], 
      pch=19, 
      cex=0.4, 
      cex.labels=0.6,
      labels=gsub('GLUC','gluc',gsub("_","-",toupper(colnames(OF[,unlist(compounds_OF)])))))

This is a pair plot showing the correlation between different compounds in th OF matrix. Notice how some of them look like two separate lines. This should be a clear sign that there’s some underlying variable that effects both of the compounds. After some trial and error, we found that this observation is due to the treatment variable - specifically, low dose and high dose group seems to have different slopes when it comes to the correlation between certain compounds.

Here’s a few of the more obvious ones:

line_colors <- c("chartreuse4", "cornflowerblue")

OF = OF_ex |>
  filter(treatment != "Placebo")

graph_1 <- ggplot(data = OF, mapping = aes(y = thc, x = cbn, color = treatment)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", formula = y ~ x, aes(color = treatment)) +
  theme_minimal() +
  scale_color_manual(values = line_colors) +
  labs(title = "THC vs. CBN", 
       x = "CBN", 
       y = "THC")
  
      
graph_2 <- ggplot(data = OF, mapping = aes(y = thc, x = cbg, color = treatment)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", formula = y ~ x, aes(color = treatment)) +
  theme_minimal() +
  scale_color_manual(values = line_colors) +
  labs(title = "THC vs. CBG", 
       x = "CBG", 
       y = "THC")

gridExtra::grid.arrange(graph_1, graph_2, ncol = 2)

Also, we are removing the placebo group since it is not of interest.

This seems to show that low dose/high dose they change the correlation between chemicals. Interesting. For this report, we will narrow our target of investigation to THC vs CBG in OF.

If there is, in fact, some significant difference between compound correlation between different treatment, we should be able to construct some sort of model that, given a sample of the measurement of those two compounds, would be able to predict which treatment group this sample is pulled from to some meaningful degree of accuracy.

Let’s do some analysis.

Extended Question: EDA

First we try fitting a single variable linear model with THC as the value, and CBG as the explanatory variable.

lm_cbg = linear_reg() |>
  set_engine("lm") |>
  fit(thc ~ cbg, data = OF) 

lm_cbg
## parsnip model object
## 
## Fit time:  3ms 
## 
## Call:
## stats::lm(formula = thc ~ cbg, data = data)
## 
## Coefficients:
## (Intercept)          cbg  
##       60.30        14.35
glance(lm_cbg)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.850         0.849  446.     3581. 5.65e-263     1 -4782. 9570. 9583.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Here the resulting linear model is THC = 14.41CBG + 39.8 with an R^2 of 0.85.

Now we test if the addition of treatment significantly improves the model by doing multiple linear regression:

#multiple linear regression model with only main effect
mlm_cbg1 = linear_reg() |>
  set_engine("lm") |>
  fit(thc ~ cbg + treatment, data = OF) 

#multiple linear regression model with main effect + interaction effect
mlm_cbg2 = linear_reg() |>
  set_engine("lm") |>
  fit(thc ~ cbg * treatment, data = OF) 

mlm_cbg1
## parsnip model object
## 
## Fit time:  4ms 
## 
## Call:
## stats::lm(formula = thc ~ cbg + treatment, data = data)
## 
## Coefficients:
##                    (Intercept)                             cbg  
##                         -16.64                           14.45  
## treatment13.4% THC (high dose)  
##                         156.34
glance(mlm_cbg1)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.854         0.854  440.     1853. 2.41e-265     2 -4772. 9552. 9570.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
mlm_cbg2
## parsnip model object
## 
## Fit time:  3ms 
## 
## Call:
## stats::lm(formula = thc ~ cbg * treatment, data = data)
## 
## Coefficients:
##                        (Intercept)                                 cbg  
##                            48.4663                             11.6821  
##     treatment13.4% THC (high dose)  cbg:treatment13.4% THC (high dose)  
##                             0.5256                             12.9414
glance(mlm_cbg2)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.969         0.969  202.     6652.       0     3 -4277. 8563. 8585.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Adding in treatment as another explanatory variable and including interaction effects produces a noticeably more accurate model - with a R^2 of ~0.97.

Just to be thorough, let’s try another way. Here we fit a linear model for high does and low dose group separately.

OF_low <- OF |>
  filter(treatment == "5.9% THC (low dose)")

OF_high <- OF |>
  filter(treatment == "13.4% THC (high dose)")

#linear regression `cbn ~ cbg` for 5.9% THC (low dose) group
lr_cbg_low = linear_reg() |>
  set_engine("lm") |>
  fit(cbn ~ cbg, data = OF_low)

glance(lr_cbg_low)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.945         0.945  34.7     5641. 1.02e-208     1 -1638. 3282. 3293.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
#linear regression `cbn ~ cbg` for 13.4% THC (high dose) group
lr_cbg_high = linear_reg() |>
  set_engine("lm") |>
  fit(cbn ~ cbg, data = OF_high) 

glance(lr_cbg_high)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.913         0.913  67.8     3191. 3.01e-163     1 -1724. 3453. 3464.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Extended Question: Results & Discussion

We attempted four different ways of modeling the relationship between CBG and THC: directly fitting a linear model with CBG as the only explanatory variable, fitting a multiple linear model with both CBG and treatment as explanatory variables but only with main effect, multiple linear model with both CBG and treatment as explanatory variables with main effect and interaction effect, and finally fitting a linear model for the low dose and high dose group separately. Out of all these models, the multiple linear model with two explanatory variables and interaction effect proves to be the best model, explaining almost 97% of all observed data. This is quite impressive, especially considering that even the linear models that are fitted onto high dose and low dose group separately did not achieve this level of accuracy.

This result has both practical implications and potential to serve as grounf for further research. On the one hand, being able to accurate predict the value of another measurement based on one measurement and information about the dosage of THC can significantly lower the cost, time, and energy required for data collection. On the other hand, this predictive power of dosage in terms of how certain compounds relate to each other seem to suggest a non-linear relationship between dosage and certain compound’s value. Further research is required to investigate this proposition.


    1. Marijuana at Work: What Employers Need to Know. NSC. https://www.nsc.org/nsc-membership/marijuana-at-work#:~:text=According%20to%20a%20study%20reported,Decreased%20productivity
    ↩︎
  1. Research Note: Results of the 2013-2014 National Roadside Survey of Alcohol and Drug Use by Drivers. NHTSA. https://www.nhtsa.gov/sites/nhtsa.gov/files/812118-roadside_survey_2014.pdf↩︎

  2. Drug-Impaired Driving. NHTSA. https://www.nhtsa.gov/risky-driving/drug-impaired-driving#:~:text=In%202007%2C%20NHTSA’s%20National%20Roadside,in%20less%20than%2010%20years.↩︎